This is the third assignment for the BCB420H1 course @ U of T. In this assignment, we will be performing non-thresholdeed GSEA on our ranked set of genes that we created in A2, and we will then visualize our GSEA using Cytoscape.
First, let’s load in all of our libraries for the first part of our analysis.
We are working with the GSE205517 dataset. This dataset and its accompanying study investigates the differentiation of hPSCs into left ventricular cardiomyocytes, and compares them to patient derived samples. (Dark et al. 2023) The study compares the transcriptomes of both conditions in order to determine similarity for potential therapeutic purposes. This study is a SuperSeries containing the SubSeries: - GSE203375, containing the hPSCs - GSE204885, containing the patient samples We will download both. In the previous assignment, we performed normalization on the RNA-Seq data, along with remapping HGNC symbols to avoid collisions. In this assignment, we will be perform differential gene expression (DGE) analysis, along with over-representation analysis (ORA). Before we begin these new steps, we will run the data acquisition, data overview, and normalization steps, and overview stats from assignment 1.
For A2, we continued with the cleaned up data and found DGEs using edgeR. We performed BH corrections, and we also displayed volcano plots to point out key genes (not pictured here). We then used heatmaps to show clustering, and then conducted ORA using g:Profiler for up and downregulated pathways. In terms of our findings:
First, let’s perform our data acquisition step using GEOQuery. (Davis and Meltzer 2007) We’ll be re-using code blocks from A1.
# GEO Accession numbers
geo_acc_1 <- "GSE203375"
geo_acc_2 <- "GSE204885"
# The filenames for saving/loading data
filename_1 <- paste0(geo_acc_1, ".RData")
filename_2 <- paste0(geo_acc_2, ".RData")
# Reading in files from the GEO or locally
if (!file.exists(filename_1)) {
gset_1 <- getGEO(geo_acc_1, GSEMatrix=TRUE, getGPL=FALSE)
saveRDS(gset_1, filename_1)
} else {
gset_1 <- readRDS(filename_1)
}
gset_1 <- gset_1[[1]]
if (!file.exists(filename_2)) {
gset_2 <- getGEO(geo_acc_2, GSEMatrix=TRUE, getGPL=FALSE)
saveRDS(gset_2, filename_2)
} else {
gset_2 <- readRDS(filename_2)
}
gset_2 <- gset_2[[1]]
We now have our data loaded in. We now need to retrieve the counts tables from the GEO.
# load counts table from GEO
urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path_1 <- paste(urld, paste0("acc=", geo_acc_1), paste0("file=", geo_acc_1, "_raw_counts_GRCh38.p13_NCBI.tsv.gz"), sep="&");
path_2 <- paste(urld, paste0("acc=", geo_acc_2), paste0("file=", geo_acc_2, "_raw_counts_GRCh38.p13_NCBI.tsv.gz"), sep="&");
# This checks if we already have the previous matrices saved in or not. If we have a local copy, we don't re-download. This is a time-saving and efficiency measure.
if (!file.exists(paste0(geo_acc_1,"_raw_counts.RData"))) {
counts_data_1 <- as.matrix(data.table::fread(path_1, header=T, colClasses="integer"), rownames=1)
name_mapping <- setNames(gset_1@phenoData@data[["title"]], gset_1@phenoData@data[["geo_accession"]])
colnames(counts_data_1) <- name_mapping[colnames(counts_data_1)]
saveRDS(counts_data_1, paste0(geo_acc_1, "_raw_counts.RData"))
} else {
counts_data_1 <- readRDS(paste0(geo_acc_1, "_raw_counts.RData"))
}
# We do the same for our second dataset
if (!file.exists(paste0(geo_acc_2,"_raw_counts.RData"))) {
counts_data_2 <- as.matrix(data.table::fread(path_2, header=T, colClasses="integer"), rownames=1)
name_mapping <- setNames(gset_2@phenoData@data[["title"]], gset_2@phenoData@data[["geo_accession"]])
colnames(counts_data_2) <- name_mapping[colnames(counts_data_2)]
saveRDS(counts_data_2, paste0(geo_acc_2, "_raw_counts.RData"))
} else {
counts_data_2 <- readRDS(paste0(geo_acc_2, "_raw_counts.RData"))
}
counts_data <- cbind(counts_data_1, counts_data_2)
This excerpt is directly from A1: Some basic information about the dataset can be obtained through the ExpressionSet we’ve read in. The study investigated differentiation of two different cell lines, H9 of karyotype 46, XX, and MLC2V of karyotype 46, XY, corresponding to cells originating from female and male sources, respectively. It then compared to to patient samples harvested from left and right atrial and ventricular tissue, along with cardiomyocytes individually harvested from each section of the heart from three patients each.
Below are the experimental details provided in the gset objects for both series:
gset_1@experimentData@title # study title
## [1] "Bulk RNA-seq of a time series of hPSCs as they differentiate towards left ventricle cardiomyocytes"
gset_1@experimentData@abstract # study abstract
## [1] "We report that both hESCs and hiPSCs can be differentiated towrads left ventricle cardiomyocytes using an optimised protocol and that these cells transit via first heart field progenitors."
gset_1@experimentData@other$overall_design # summary of experiment
## [1] "Examination of the transcriptome of two hPSC lines as they differentiate towards left ventricle cardiomyocytes; critical time points were collected"
unique(gset_1@phenoData@data[["extract_protocol_ch1.1"]]) # platform
## [1] "KAPA mRNA (polyA) HyperPrep"
unique(gset_1@phenoData@data[["experiment number:ch1"]]) # heart
## [1] "h28" "h30" "h32" "h41" "h18" "h42"
unique(gset_1@phenoData@data[["cell_line:ch1"]]) # cell lines
## [1] "H9" "MLC2V"
unique(gset_1@phenoData@data[["day:ch1"]]) # day of differentiation
## [1] "0" "2" "4" "10" "15" "20" "40" "60" "6"
gset_1@experimentData@other$platform_id # GPL id
## [1] "GPL20301"
gset_1@experimentData@other$last_update_date # last update date
## [1] "Aug 08 2023"
unique(gset_1@phenoData@data$organism_ch1) # organism
## [1] "Homo sapiens"
gset_2@experimentData@title # study title
## [1] "Bulk-RNA-sequencing from chamber specific heart tissue and isolated cardiomyocytes"
gset_2@experimentData@abstract # study abstract
## [1] "Healthy hearts from warm cadavers not usable for heart transplants were identified and tissues were isolated from these. From ventricle tissues, cardiomyocytes were further isolated. Bulk-RNA-sequencing was performed."
gset_2@experimentData@other$overall_design # summary of experiment
## [1] "Bulk-RNA-sequencing from 3 different donors (tissue) and 3 other donors (isolated CMs)"
unique(gset_2@phenoData@data[["extract_protocol_ch1.1"]]) # platform
## [1] "KAPA mRNA (polyA) HyperPrep"
unique(gset_2@phenoData@data[["chamber:ch1"]]) # cell lines
## [1] "LV" "RV" "LA" "RA"
unique(gset_2@phenoData@data[["patient:ch1"]]) # patient
## [1] "P1" "P2" "P3" "P4" "P5" "P6"
unique(gset_2@phenoData@data[["tissue:ch1"]]) # tissue
## [1] "Ventricle Isolated Cells" "Heart tissue"
gset_2@experimentData@other$platform_id # GPL id
## [1] "GPL20301"
gset_2@experimentData@other$last_update_date # last update date
## [1] "Apr 26 2023"
unique(gset_2@phenoData@data$organism_ch1) # organism
## [1] "Homo sapiens"
sample_info_1 <- gset_1@phenoData@data[ , (ncol(gset_1@phenoData@data)-2):ncol(gset_1@phenoData@data)]
colnames(sample_info_1) <- gsub(":ch1", "", colnames(sample_info_1))
knitr::kable(sample_info_1, format = "pipe", caption = "<b>Table 1:</b> Sample information for Cardiomyocytes Derived from Stem Cells")
| cell_line | day | experiment number | |
|---|---|---|---|
| GSM6170585 | H9 | 0 | h28 |
| GSM6170586 | H9 | 2 | h28 |
| GSM6170587 | H9 | 4 | h28 |
| GSM6170588 | H9 | 10 | h28 |
| GSM6170589 | H9 | 15 | h28 |
| GSM6170590 | H9 | 20 | h28 |
| GSM6170591 | H9 | 40 | h28 |
| GSM6170592 | H9 | 60 | h28 |
| GSM6170593 | H9 | 0 | h30 |
| GSM6170594 | H9 | 2 | h30 |
| GSM6170595 | H9 | 4 | h30 |
| GSM6170596 | H9 | 6 | h30 |
| GSM6170597 | H9 | 10 | h30 |
| GSM6170598 | H9 | 15 | h30 |
| GSM6170599 | H9 | 20 | h30 |
| GSM6170600 | H9 | 40 | h30 |
| GSM6170601 | H9 | 60 | h30 |
| GSM6170602 | H9 | 0 | h32 |
| GSM6170603 | H9 | 2 | h32 |
| GSM6170604 | H9 | 4 | h32 |
| GSM6170605 | H9 | 6 | h32 |
| GSM6170606 | H9 | 15 | h32 |
| GSM6170607 | H9 | 20 | h32 |
| GSM6170608 | H9 | 40 | h32 |
| GSM6170609 | H9 | 60 | h32 |
| GSM6170610 | H9 | 0 | h41 |
| GSM6170611 | H9 | 2 | h41 |
| GSM6170612 | H9 | 4 | h41 |
| GSM6170613 | H9 | 6 | h41 |
| GSM6170614 | H9 | 10 | h41 |
| GSM6170615 | H9 | 15 | h41 |
| GSM6170616 | H9 | 20 | h41 |
| GSM6170617 | H9 | 40 | h41 |
| GSM6170618 | MLC2V | 0 | h18 |
| GSM6170619 | MLC2V | 2 | h18 |
| GSM6170620 | MLC2V | 4 | h18 |
| GSM6170621 | MLC2V | 6 | h18 |
| GSM6170622 | MLC2V | 10 | h18 |
| GSM6170623 | MLC2V | 15 | h18 |
| GSM6170624 | MLC2V | 20 | h18 |
| GSM6170625 | MLC2V | 40 | h18 |
| GSM6170626 | MLC2V | 60 | h18 |
| GSM6170627 | MLC2V | 0 | h42 |
| GSM6170628 | MLC2V | 2 | h42 |
| GSM6170629 | MLC2V | 4 | h42 |
| GSM6170630 | MLC2V | 6 | h42 |
| GSM6170631 | MLC2V | 10 | h42 |
| GSM6170632 | MLC2V | 15 | h42 |
| GSM6170633 | MLC2V | 20 | h42 |
| GSM6170634 | MLC2V | 40 | h42 |
| GSM6170635 | MLC2V | 60 | h42 |
sample_info_2 <- gset_2@phenoData@data[ , (ncol(gset_2@phenoData@data)-2):ncol(gset_2@phenoData@data)]
colnames(sample_info_2) <- gsub(":ch1", "", colnames(sample_info_2))
knitr::kable(sample_info_2, format = "pipe", caption = "<b>Table 2:</b> Sample information for Patient-Derived Samples")
| chamber | patient | tissue | |
|---|---|---|---|
| GSM6199297 | LV | P1 | Ventricle Isolated Cells |
| GSM6199298 | RV | P1 | Ventricle Isolated Cells |
| GSM6199299 | LV | P2 | Ventricle Isolated Cells |
| GSM6199300 | RV | P2 | Ventricle Isolated Cells |
| GSM6199301 | LV | P3 | Ventricle Isolated Cells |
| GSM6199302 | RV | P3 | Ventricle Isolated Cells |
| GSM6199303 | LA | P4 | Heart tissue |
| GSM6199304 | LV | P4 | Heart tissue |
| GSM6199305 | RA | P4 | Heart tissue |
| GSM6199306 | RV | P4 | Heart tissue |
| GSM6199307 | LA | P5 | Heart tissue |
| GSM6199308 | LV | P5 | Heart tissue |
| GSM6199309 | RA | P5 | Heart tissue |
| GSM6199310 | RV | P5 | Heart tissue |
| GSM6199311 | LA | P6 | Heart tissue |
| GSM6199312 | LV | P6 | Heart tissue |
| GSM6199313 | RA | P6 | Heart tissue |
| GSM6199314 | RV | P6 | Heart tissue |
To clean our data, we will do two things: 1. Filter out zero-expressors 2. Perform HGNC re-mapping 3. Normalize the data
From A1, we know that there are genes that minimally express throughout different conditions, and as a result, we will be filtering those out.
# Pick out all rows that have zeroes across all conditions
rows_to_keep <- apply(counts_data, 1, function(row) any(row != 0))
# Create new matrix containing only the rows that aren't all zeroes
counts_data_zero_filtered <- counts_data[rows_to_keep, ]
In line with the experimenters’ analyses, we will normalize the data together
First, let’s perform CPM filtering to remove low expressors.
# Minimum number of samples for the hPSC run
min_num_samples <- 2
counts_data_zero_filtered_matrix <- as.matrix(counts_data_zero_filtered)
# Get rid of low counts
# We add 0.1 for numerical stability and to prevent NA values when evaluating logs
keep = rowSums(log2(cpm(counts_data_zero_filtered_matrix+0.1)) > 1) > min_num_samples
filtered_counts_matrix = counts_data_zero_filtered_matrix[keep,]
After, we will use TMM from edgeR (Robinson, McCarthy, and Smyth 2010) to perform our normalization steps.
# We will make our DGEList with the filtered count matrices
d <- DGEList(counts=filtered_counts_matrix)
# Calculate normalization factors
d <- calcNormFactors(d)
# Apply normalization
normalized_counts <- cpm(d)
Normalization is now done, and we can move onto HGNC mapping.
The protocol followed for HGNC mapping is nearly identical to the first assignment. First, we will add a column to our normalized with the appropriate NCBI gene ID. While we do have the annotation table, it is also true that we were specified to still manually perform this step. We will download a copy of the HGNC dataset. (Tweedie et al. 2020) The dataset is confirmed to be up-to-date with the date of download.
# Download the HGNC file if it doesn't already exist in the cwd
if (!file.exists("hgnc_genes.RData")) {
hgnc_genes <- import_hgnc_dataset(file=latest_archive_url())
saveRDS(hgnc_genes, "hgnc_genes.RData")
} else {
hgnc_genes <- readRDS("hgnc_genes.RData")
}
hgnc_mapping <- hgnc_genes[, c('symbol', 'entrez_id')]
Now that we have a mapping, we can apply this to the table.
# Convert the normalized counts into a dataframe
normalized_counts_df <- as.data.frame(normalized_counts)
# Add rownames
normalized_counts_df$NCBI_gene_id <- rownames(normalized_counts_df)
# Map the entrez_id to the NCBI_gnes
labelled_counts_data <- merge(normalized_counts_df, hgnc_mapping, by.x = "NCBI_gene_id", by.y = 'entrez_id', all.x = TRUE)
labelled_counts_data <- labelled_counts_data[,c(1, ncol(labelled_counts_data), 3:ncol(labelled_counts_data)-1)]
knitr::kable(labelled_counts_data[c(1:10), c(1,2)], format = "pipe", caption = "<b>Table 3:</b> Table Matching First 10 NCBI Gene IDs to the Approved HGNC Symbols")
| NCBI_gene_id | symbol |
|---|---|
| 1 | A1BG |
| 100 | ADA |
| 1000 | CDH2 |
| 10000 | AKT3 |
| 100008587 | RNA5-8SN5 |
| 100008588 | RNA18SN5 |
| 100008589 | RNA28SN5 |
| 100009676 | ZBTB11-AS1 |
| 10001 | MED6 |
| 10003 | NAALAD2 |
labelled_counts_data <- labelled_counts_data[, -c(1)]
Now, we must deal with NA and empty values, along with
many-to-one mappings.
# Remove all NA labels
no_na_labelled_counts_data <- labelled_counts_data[!is.na(labelled_counts_data$symbol), ]
# Remove all empty strings symbols
no_emp_labelled_counts_data <- no_na_labelled_counts_data[no_na_labelled_counts_data$symbol != '', ]
rownames(no_emp_labelled_counts_data) <- no_emp_labelled_counts_data[, 1]
final_normalized_data <- no_emp_labelled_counts_data[, -c(1)]
Now that we have our data, we can look at a few counting statistics to gain insight into what we’re looking at.
# Create a list of overview stats
overview_stats <- list()
# Do it for each sample
for (sample_name in colnames(final_normalized_data)) {
sample_data <- final_normalized_data[, sample_name]
total_counts <- sum(sample_data)
mean_counts <- mean(sample_data)
median_counts <- median(sample_data)
sd_counts <- sd(sample_data)
var_counts <- var(sample_data)
overview_stats[[sample_name]] <- c(total_counts, mean_counts, median_counts, sd_counts, var_counts)
}
overview_stats_df <- do.call(rbind, overview_stats)
rownames(overview_stats_df) <- names(overview_stats)
colnames(overview_stats_df) <- c("Total Counts", "Mean Counts", "Median Counts", "Counts Standard Deviation", "Counts Variation")
knitr::kable(overview_stats_df[, ], format = "pipe", caption = "<b>Table 4:</b> Overview Statistics of Normalized Samples. The total counts are somewhat similar, but the mean counts are lower in the hPSC-derived samples. Mean counts are somewhat similar in both, but the median counts are greater in the hPSCs. The standard deviation and variation in the hPSCs is far lesser than that of the patient samples.")
| Total Counts | Mean Counts | Median Counts | Counts Standard Deviation | Counts Variation | |
|---|---|---|---|---|---|
| Heart 28 Day 0 H9 | 1021771.9 | 59.98426 | 12.24620 | 431.8453 | 186490.34 |
| Heart 28 Day 2 H9 | 786599.4 | 46.17819 | 13.63105 | 154.4348 | 23850.12 |
| Heart 28 Day 4 H9 | 716839.2 | 42.08285 | 13.82826 | 125.4171 | 15729.45 |
| Heart 28 Day 10 H9 | 793482.4 | 46.58227 | 13.73218 | 236.6323 | 55994.84 |
| Heart 28 Day 15 H9 | 745914.9 | 43.78977 | 14.72937 | 195.7736 | 38327.31 |
| Heart 28 Day 20 H9 | 993073.6 | 58.29949 | 14.57508 | 340.7407 | 116104.22 |
| Heart 28 Day 40 H9 | 1023479.0 | 60.08448 | 14.23242 | 342.5525 | 117342.18 |
| Heart 28 Day 60 H9 | 829562.9 | 48.70042 | 14.76538 | 223.1736 | 49806.45 |
| Heart 30 Day 0 H9 | 863916.5 | 50.71719 | 12.75642 | 218.8989 | 47916.72 |
| Heart 30 Day 2 H9 | 771994.0 | 45.32077 | 13.16358 | 169.0652 | 28583.03 |
| Heart 30 Day 4 H9 | 699609.0 | 41.07133 | 13.84685 | 136.2720 | 18570.05 |
| Heart 30 Day 6 H9 | 786225.2 | 46.15623 | 14.09130 | 184.6986 | 34113.57 |
| Heart 30 Day 10 H9 | 761009.1 | 44.67589 | 14.40243 | 189.8644 | 36048.49 |
| Heart 30 Day 15 H9 | 908915.2 | 53.35888 | 13.90010 | 360.0973 | 129670.08 |
| Heart 30 Day 20 H9 | 887759.9 | 52.11694 | 13.69645 | 273.7729 | 74951.58 |
| Heart 30 Day 40 H9 | 1010387.1 | 59.31591 | 13.56070 | 489.7979 | 239901.95 |
| Heart 30 Day 60 H9 | 920475.0 | 54.03751 | 14.47831 | 373.9279 | 139822.08 |
| Heart 32 Day 0 H9 | 779406.7 | 45.75594 | 13.18829 | 149.3232 | 22297.42 |
| Heart 32 Day 2 H9 | 770684.2 | 45.24387 | 13.37835 | 153.5595 | 23580.52 |
| Heart 32 Day 4 H9 | 702719.5 | 41.25393 | 13.64478 | 129.9497 | 16886.93 |
| Heart 32 Day 6 H9 | 772659.8 | 45.35985 | 14.03981 | 170.6316 | 29115.14 |
| Heart 32 Day 15 H9 | 844624.9 | 49.58465 | 13.86116 | 271.0413 | 73463.40 |
| Heart 32 Day 20 H9 | 831683.9 | 48.82493 | 14.49759 | 253.6298 | 64328.10 |
| Heart 32 Day 40 H9 | 805256.4 | 47.27347 | 15.13632 | 206.4093 | 42604.78 |
| Heart 32 Day 60 H9 | 1339814.7 | 78.65532 | 12.68147 | 1102.2826 | 1215026.94 |
| Heart 41 Day 0 H9 | 747252.1 | 43.86827 | 13.25882 | 164.7467 | 27141.46 |
| Heart 41 Day 2 H9 | 771285.9 | 45.27920 | 13.46844 | 181.9923 | 33121.19 |
| Heart 41 Day 4 H9 | 727104.3 | 42.68547 | 13.48812 | 172.0213 | 29591.32 |
| Heart 41 Day 6 H9 | 744472.3 | 43.70508 | 14.23815 | 176.5771 | 31179.46 |
| Heart 41 Day 10 H9 | 800151.2 | 46.97377 | 14.00636 | 311.0775 | 96769.22 |
| Heart 41 Day 15 H9 | 927299.3 | 54.43814 | 13.63959 | 425.1034 | 180712.93 |
| Heart 41 Day 20 H9 | 809279.0 | 47.50963 | 14.63262 | 230.8466 | 53290.15 |
| Heart 41 Day 40 H9 | 826540.3 | 48.52297 | 15.18484 | 211.2894 | 44643.22 |
| Heart 18 Day 0 MLC2V | 802319.7 | 47.10108 | 13.25450 | 158.2561 | 25045.00 |
| Heart 18 Day 2 MLC2V | 764355.6 | 44.87235 | 13.46746 | 169.3843 | 28691.04 |
| Heart 18 Day 4 MLC2V | 707001.2 | 41.50530 | 14.09808 | 150.1960 | 22558.85 |
| Heart 18 Day 6 MLC2V | 769018.6 | 45.14609 | 14.25261 | 169.0281 | 28570.48 |
| Heart 18 Day 10 MLC2V | 791369.4 | 46.45822 | 14.52728 | 219.7194 | 48276.59 |
| Heart 18 Day 15 MLC2V | 791232.2 | 46.45017 | 14.62889 | 233.2285 | 54395.55 |
| Heart 18 Day 20 MLC2V | 899519.9 | 52.80732 | 14.82049 | 244.6640 | 59860.47 |
| Heart 18 Day 40 MLC2V | 935508.8 | 54.92009 | 14.67489 | 361.6569 | 130795.68 |
| Heart 18 Day 60 MLC2V | 933355.7 | 54.79369 | 14.68259 | 421.3648 | 177548.34 |
| Heart 42 Day 0 MLC2V | 816656.5 | 47.94273 | 13.07401 | 196.8750 | 38759.78 |
| Heart 42 Day 2 MLC2V | 789418.7 | 46.34370 | 13.20376 | 183.5629 | 33695.34 |
| Heart 42 Day 4 MLC2V | 730009.3 | 42.85601 | 13.51457 | 162.4742 | 26397.86 |
| Heart 42 Day 6 MLC2V | 779092.2 | 45.73748 | 13.98386 | 184.5081 | 34043.25 |
| Heart 42 Day 10 MLC2V | 860421.6 | 50.51201 | 14.31508 | 232.9254 | 54254.26 |
| Heart 42 Day 15 MLC2V | 775917.9 | 45.55113 | 14.74511 | 236.1324 | 55758.50 |
| Heart 42 Day 20 MLC2V | 849629.4 | 49.87844 | 14.70218 | 252.2778 | 63644.10 |
| Heart 42 Day 40 MLC2V | 931343.7 | 54.67557 | 14.76092 | 407.5923 | 166131.48 |
| Heart 42 Day 60 MLC2V | 972308.9 | 57.08048 | 15.28911 | 496.6643 | 246675.41 |
| Patient 1 Left Ventricle Isolated Cells | 2097613.3 | 123.14273 | 12.71544 | 2054.7924 | 4222171.89 |
| Patient 1 Right Ventricle Isolated cells | 1801820.6 | 105.77789 | 12.64497 | 1652.3177 | 2730153.91 |
| Patient 2 Left Ventricle Isolated Cells | 2448968.4 | 143.76943 | 11.90376 | 2919.5824 | 8523961.52 |
| Patient 2 Right Ventricle Isolated Cells | 2338963.9 | 137.31149 | 11.48868 | 2609.5704 | 6809857.69 |
| Patient 3 Left Ventricle Isolated Cells | 1955785.5 | 114.81657 | 12.28001 | 1785.9598 | 3189652.30 |
| Patient 3 Right Ventricle Isolated Cells | 1841091.5 | 108.08333 | 12.63626 | 1779.0770 | 3165115.09 |
| Patient 4 Left Atria Tissue | 1289436.8 | 75.69783 | 15.15308 | 1052.6384 | 1108047.51 |
| Patient 4 Left Ventricle Tissue | 1993945.7 | 117.05681 | 14.52426 | 2527.4069 | 6387785.88 |
| Patient 4 Right Atria Tissue | 1289380.0 | 75.69449 | 14.74821 | 1068.7044 | 1142129.05 |
| Patient 4 Right Ventricle Tissue | 1729378.7 | 101.52511 | 13.97466 | 1684.3179 | 2836926.68 |
| Patient 5 Left Atria Tissue | 1374196.6 | 80.67374 | 14.72154 | 1212.6323 | 1470477.14 |
| Patient 5 Left Ventricle Tissue | 1610446.1 | 94.54304 | 14.15593 | 1607.2241 | 2583169.31 |
| Patient 5 Right Atria Tissue | 1383940.8 | 81.24579 | 15.08133 | 1388.4255 | 1927725.46 |
| Patient 5 Right Ventricle Tissue | 1649782.1 | 96.85230 | 14.02391 | 1703.4872 | 2901868.80 |
| Patient 6 Left Atria Tissue | 1432786.4 | 84.11333 | 14.46312 | 1383.8852 | 1915138.25 |
| Patient 6 Left Ventricle Tissue | 1393964.4 | 81.83424 | 14.18233 | 1300.1378 | 1690358.21 |
| Patient 6 Right Atria Tissue | 1273212.8 | 74.74538 | 14.75168 | 1055.1550 | 1113352.02 |
| Patient 6 Right Ventricle Tissue | 1460262.8 | 85.72636 | 14.21627 | 1364.0782 | 1860709.32 |
We can see that there is strong clustering of all the patient-derived samples, and there’s also temporal clustering of all the stem cell-derived CMs. This is evidenced by the fact the from left to right, we see a ‘continuous’ clustering of cardiomyocytes from d60 to d0. We see that the patient samples all cluster together quite strongly, but further away in the firsta and second dimensions. This can likely be explained due to higher variance and low median counts in the patient data, but the fact that the cells came from warm cadavers, and the cell niche The range for the MDS Leading logFC dimensions are also quite high, which could be explained due to having two different types of data here.
We will now prefer differential gene expression analyses. Based on the results of the previously generated MDS plots, I think that important analyses would be: * Pairwise comparisons between each time point of the hPSC to cardiomyocyte differentiation + Specifically a comparison between day 0 and day 60 of the stem cell experiments - Comparison of the day 60 * hPSC-cardiomyocyte to the rest of the patient heart cells
We will determine the p-values of each of the experiment using edgeR. (Robinson, McCarthy, and Smyth 2010)
# 1. Prepare the data
# Truncate the sample names
converted_names <- gsub("Heart \\d+ Day (\\d+) .*", "day \\1", colnames(final_normalized_data))
converted_names <- gsub("Patient \\d+ (Left|Right) (Ventricle|Atria) .*", "\\1 \\2", converted_names)
converted_names <- gsub(" ", "", converted_names)
# Create a data frame with sample information
samples <- data.frame(converted_names)
# Convert the count data to a DGEList object
dge <- DGEList(counts = final_normalized_data)
# 2. Create the design matrix
# Create the design matrix with only 'day' as a factor
design <- model.matrix(~0 + factor(samples$converted_names))
# Rename the columns to be more interpretable
colnames(design) <- gsub("factor\\(samples\\$converted_names\\)", "", colnames(design))
dge <- estimateDisp(dge, design)
# 4. Fit the model
fit <- glmQLFit(dge, design)
Now, we will create the contrast matrices for comparisons.
# 5. Create contrast matrices
twoVS0con <- makeContrasts(
day2vsday0 = day2 - day0,
levels = design
)
fourVS2con <- makeContrasts(
day4vsday2 = day4 - day2,
levels = design
)
sixVS4con <- makeContrasts(
day6vsday4 = day6 - day4,
levels = design
)
tenVS6con <- makeContrasts(
day10vsday6 = day10 - day6,
levels = design
)
fifteenVS10con <- makeContrasts(
day15vsday10 = day15 - day10,
levels = design
)
twentyVS15con <- makeContrasts(
day20vsday15 = day20 - day15,
levels = design
)
fortyVS20con <- makeContrasts(
day40vsday20 = day40 - day20,
levels = design
)
sixtyVS40con <- makeContrasts(
day60vsday40 = day60 - day40,
levels = design
)
sixtyVS0con <- makeContrasts(
day60vsday0 = day60 - day0,
levels = design
)
lvVS60 <- makeContrasts(
LVvsday0 = LeftVentricle - day60,
levels = design
)
rvVS60 <- makeContrasts(
RVvsday60 = RightVentricle - day60,
levels = design
)
laVS60 <- makeContrasts(
LAvsday60 = LeftAtria - day60,
levels = design
)
raVS60 <- makeContrasts(
RAvsday60 = RightAtria - day60,
levels = design
)
lvVSrv <- makeContrasts(
lvvsrv = LeftVentricle - RightVentricle,
levels = design
)
Now, we can find differential expression between conditions
# 6. Perform the tests
twoVS0_lrt <- glmQLFTest(fit, contrast = twoVS0con)
fourVS2_lrt <- glmQLFTest(fit, contrast = fourVS2con)
sixVS4_lrt <- glmQLFTest(fit, contrast = sixVS4con)
tenVS6_lrt <- glmQLFTest(fit, contrast = tenVS6con)
fifteenVS10_lrt <- glmQLFTest(fit, contrast = fifteenVS10con)
twentyVS15_lrt <- glmQLFTest(fit, contrast = twentyVS15con)
fortyVS20_lrt <- glmQLFTest(fit, contrast = fortyVS20con)
sixtyVS40_lrt <- glmQLFTest(fit, contrast = sixtyVS40con)
sixtyVS0_lrt <- glmQLFTest(fit, contrast = sixtyVS0con)
lvVS60_lrt <- glmQLFTest(fit, contrast = lvVS60)
rvVS60_lrt <- glmQLFTest(fit, contrast = rvVS60)
laVS60_lrt <- glmQLFTest(fit, contrast = laVS60)
raVS60_lrt <- glmQLFTest(fit, contrast = raVS60)
lvVSrv_lrt <- glmQLFTest(fit, contrast = lvVSrv)
# 7. Extract results
results_twoVS0 <- topTags(twoVS0_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fourVS2 <- topTags(fourVS2_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixVS4 <- topTags(sixVS4_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_tenVS6 <- topTags(tenVS6_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fifteenVS10 <- topTags(fifteenVS10_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_twentyVS15 <- topTags(twentyVS15_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fortyVS20 <- topTags(fortyVS20_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixtyVS40 <- topTags(sixtyVS40_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixtyVS0 <- topTags(sixtyVS0_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_lvVS60 <- topTags(lvVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_rvVS60 <- topTags(rvVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_laVS60 <- topTags(laVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_raVS60 <- topTags(raVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_lvVSrv <- topTags(lvVSrv_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
I chose to use a cut-off of 0.01, in-line with the original study, and the limit the amount of genes that show up as significant, as there is a lot of temporal variation in stem cell differentiation protocols. In terms of the multiple hypothesis correction, I chose to use Benjamini-Hochberg for its robustness and that it seems quite standard in such analyses.
We will now present summary data in a table for each comparison.
summary_table <- data.frame(
Comparison = c("Day 2 VS Day 0",
"Day 4 VS Day 2",
"Day 6 VS Day 4",
"Day 10 VS Day 6",
"Day 15 VS Day 10",
"Day 20 VS Day 15",
"Day 40 VS Day 20",
"Day 60 VS Day 40",
"Day 60 VS Day 0",
"LV VS Day 60",
"RV VS Day 60",
"LA VS Day 60",
"RA VS Day 60",
"LV VS RV"
),
Significant = c(length(which(results_twoVS0$table$PValue < 0.01)),
length(which(results_fourVS2$table$PValue < 0.01)),
length(which(results_sixVS4$table$PValue < 0.01)),
length(which(results_tenVS6$table$PValue < 0.01)),
length(which(results_fifteenVS10$table$Value < 0.01)),
length(which(results_twentyVS15$table$PValue < 0.01)),
length(which(results_fortyVS20$table$PValue < 0.01)),
length(which(results_sixtyVS40$table$PValue < 0.01)),
length(which(results_sixtyVS0$table$PValue < 0.01)),
length(which(results_lvVS60$table$PValue < 0.01)),
length(which(results_rvVS60$table$PValue < 0.01)),
length(which(results_laVS60$table$PValue < 0.01)),
length(which(results_raVS60$table$PValue < 0.01)),
length(which(results_lvVSrv$table$PValue < 0.01))
),
Corrected = c(length(which(results_twoVS0$table$FDR < 0.01)),
length(which(results_fourVS2$table$FDR < 0.01)),
length(which(results_sixVS4$table$FDR < 0.01)),
length(which(results_tenVS6$table$FDR < 0.01)),
length(which(results_fifteenVS10$table$FDR < 0.01)),
length(which(results_twentyVS15$table$FDR < 0.01)),
length(which(results_fortyVS20$table$FDR < 0.01)),
length(which(results_sixtyVS40$table$FDR < 0.01)),
length(which(results_sixtyVS0$table$FDR < 0.01)),
length(which(results_lvVS60$table$FDR < 0.01)),
length(which(results_rvVS60$table$FDR < 0.01)),
length(which(results_laVS60$table$FDR < 0.01)),
length(which(results_raVS60$table$FDR < 0.01)),
length(which(results_lvVSrv$table$FDR < 0.01))
)
)
kable(summary_table, caption = "Table 5: The number of genes that show the number of significantly differentially expressed genes before and after FDR correction. For the temporal analysis, the first two comparisons hover around 1800 genes, the next two are at around 900, and then there's a steep drop off at the day 15 vs 10 and day 20 vs 15 comparisons. This picks back up. This picks back up intil days 60 to 40, showing minimal change, suggesting that they're in late stage maturation. Between day 60 and patient samples, many genes, all above 6000, are significantly differentially expressed. Between ventricles, no change is observed. ")
| Comparison | Significant | Corrected |
|---|---|---|
| Day 2 VS Day 0 | 3230 | 1895 |
| Day 4 VS Day 2 | 3006 | 1774 |
| Day 6 VS Day 4 | 2056 | 866 |
| Day 10 VS Day 6 | 2301 | 1006 |
| Day 15 VS Day 10 | 0 | 64 |
| Day 20 VS Day 15 | 1641 | 416 |
| Day 40 VS Day 20 | 2501 | 1163 |
| Day 60 VS Day 40 | 464 | 15 |
| Day 60 VS Day 0 | 9281 | 8768 |
| LV VS Day 60 | 10698 | 10290 |
| RV VS Day 60 | 10423 | 9999 |
| LA VS Day 60 | 7281 | 6360 |
| RA VS Day 60 | 7082 | 6126 |
| LV VS RV | 17 | 0 |
As expected, the comparison with the greatest number of differentially expressed genes was between day 60 and day 0, which compares an hPSC to a left ventricular cardiomyocyte derived from stem cells. Interestingly, we see a very small amount of differentially expressed genes for day 15 vs day 10 and day 60 and day 40. In terms of the patient sample comparisons, we see that all of them have a great number of genes with a significant values of differential expression, which indicates that even the mature hPSC-derived CM is very far from the patient by transcriptome. This could be explained due to the niche of the cell–cardiomyocytes in vivo are constantly beating, working to pump blood, and are also surrounded by other tissue types which secrete signals to regulate expression of different transcripts. In a dish, this isn’t the case, which could explain this phenomenon. We can investigate this more closely once we see which genes come out as top hits.
We will save our results in a list for ease of downstream analysis.
# List of all the results objects
results_list <- list(
results_twoVS0 = results_twoVS0,
results_fourVS2 = results_fourVS2,
results_sixVS4 = results_sixVS4,
results_tenVS6 = results_tenVS6,
results_fifteenVS10 = results_fifteenVS10,
results_twentyVS15 = results_twentyVS15,
results_fortyVS20 = results_fortyVS20,
results_sixtyVS40 = results_sixtyVS40,
results_sixtyVS0 = results_sixtyVS0,
results_lvVS60 = results_lvVS60,
results_rvVS60 = results_rvVS60,
results_laVS60 = results_laVS60,
results_raVS60 = results_raVS60,
results_lvVSrv = results_lvVSrv
)
# Titles for each plot
titles <- c(
"Day 2 vs Day 0",
"Day 4 vs Day 2",
"Day 6 vs Day 4",
"Day 10 vs Day 6",
"Day 15 vs Day 10",
"Day 20 vs Day 15",
"Day 40 vs Day 20",
"Day 60 vs Day 40",
"Day 60 vs Day 0",
"LV vs Day 60",
"RV vs Day 60",
"LA vs Day 60",
"RA vs Day 60",
"LV vs RV"
)
We will now filter with the same parameters used to filter in the volcano plots.
# Filter each comparison based on the specified criteria
filtered_results_list <- lapply(results_list, function(result) {
result$table %>%
filter(FDR < 0.01, abs(logFC) > 2)
})
# Names for the filtered results
names(filtered_results_list) <- names(results_list)
filtered_results_display_df <- data.frame(
Num_Genes = sapply(filtered_results_list, nrow)
)
kable(filtered_results_display_df, caption = "Table 6: Number of genes meeting criteria for each comparison. Notice that because of the logFC conition, all values are necessarily equal to or smaller than the values in Table 5.")
| Num_Genes | |
|---|---|
| results_twoVS0 | 563 |
| results_fourVS2 | 616 |
| results_sixVS4 | 368 |
| results_tenVS6 | 251 |
| results_fifteenVS10 | 20 |
| results_twentyVS15 | 132 |
| results_fortyVS20 | 323 |
| results_sixtyVS40 | 9 |
| results_sixtyVS0 | 3720 |
| results_lvVS60 | 3836 |
| results_rvVS60 | 3762 |
| results_laVS60 | 2369 |
| results_raVS60 | 2289 |
| results_lvVSrv | 0 |
# Extract gene names from each filtered result and combine them into a single vector
union_gene_names <- unlist(lapply(filtered_results_list, function(filtered_result) {
rownames(filtered_result)
}))
# Remove duplicate gene names
all_genes <- unique(union_gene_names)
We will also need single genes for downstream analysis.
# Get only the unique genes for downstream analysis
gene_counts <- table(union_gene_names)
unique_genes <- names(gene_counts[gene_counts == 1])
Here, we will define up and down-regulated gene sets. The goal of this function block was to create a general overall list of genes that were up and down-regulated for each condition, and to create a pooled one as well. In terms of significance values, we’ll use a cut-off of FDR < 0.05 and the magnitude of logFC being 0. These are the same parameters the authors used in their analysis.
ora_filtered_results_list <- lapply(results_list, function(result) {
result$table %>%
filter(FDR < 0.05)
})
up_regulated_results_list <- lapply(ora_filtered_results_list, function(result) {
result %>%
filter(logFC > 0)
})
up_regulated_results_list <- lapply(up_regulated_results_list, function(result) {
rownames(result)
})
down_regulated_results_list <- lapply(ora_filtered_results_list, function(result) {
result %>%
filter(logFC < 0)
})
down_regulated_results_list <- lapply(down_regulated_results_list, function(result) {
rownames(result)
})
all_up_regulated_genes <- unique(unlist(up_regulated_results_list))
all_down_regulated_genes <- unique(unlist(down_regulated_results_list))
Let’s set up the g:Profiler for all of the aforementioned combinations (per condition, for significant, up and down-regulated genes). (Reimand et al. 2007) We will be using the g:Profiler since it’s a popular and robust tool for ORA, and it was also a technique learned in a previous homework assignment. The paper only uses the GO BP annotation dataset, and we will be doing the same. This dataset …
go_upreg <- gost(
query = all_up_regulated_genes,
organism = "hsapiens",
sources = c("GO:BP")
)
go_downreg <- gost(
query = all_down_regulated_genes,
organism = "hsapiens",
sources = c("GO:BP")
)
go_updownreg <- gost(
query = list(Downregulated = all_down_regulated_genes, Upregulated = all_up_regulated_genes),
organism = "hsapiens",
sources = c("GO:BP")
)
We can see that running the analysis together or separately still yields the same number of total pathways being recognized.
Let’s look at a few up-regulated terms.
We will filter pathways with greater than 1000 terms, since they could be ‘ubiquitous’ and yield false positives. Start with up-regulation.
filtered_go_upreg <- go_updownreg$result[
which(go_updownreg$result$query == "Upregulated" & go_updownreg$result$term_size <= 1000 ),]
And now down-regulation.
filtered_go_downreg <- go_updownreg$result[
which(go_updownreg$result$query == "Downregulated" & go_updownreg$result$term_size <= 1000 ),]
We will extract up and down-regualted genes using the same conditions for all conditions.
# Initialize lists to hold the gene sets
upregulated_genesets <- list()
downregulated_genesets <- list()
# Iterate over the filtered results list
for (condition_name in names(ora_filtered_results_list)) {
# Extract the data frame for the condition
condition_df <- ora_filtered_results_list[[condition_name]]
# Filter for upregulated genes (FDR < 0.05 and logFC > 0)
upregulated_genes <- condition_df %>%
filter(FDR < 0.05, logFC > 0) %>%
rownames()
# Filter for downregulated genes (FDR < 0.05 and logFC < 0)
downregulated_genes <- condition_df %>%
filter(FDR < 0.05, logFC < 0) %>%
rownames()
# Add to the lists
upregulated_genesets[[condition_name]] <- upregulated_genes
downregulated_genesets[[condition_name]] <- downregulated_genes
}
First, we’ll look at the temporal comparisons.
# Initialize an empty list to hold the GO analysis for each comparison
days_analysis_list <- vector("list", 9)
# Iterate over each comparison
for (i in 1:9) {
cond <- names(ora_filtered_results_list)[i]
# Perform g:Profiler analysis for upregulated and downregulated genes separately
go_analysis <- gost(
query = list(Downregulated = downregulated_genesets[[cond]],
Upregulated = upregulated_genesets[[cond]]),
organism = "hsapiens",
sources = c("GO:BP")
)
# Add the results to the list
days_analysis_list[[cond]] <- go_analysis
}
# Remove the first 9 empty entries
days_analysis_list <- days_analysis_list[-(1:9)]
And now the patient ones.
# Initialize an empty list to hold the GO analysis for each comparison
patient_analysis_list <- vector("list", 4)
# Iterate over each comparison
for (i in 10:13) {
cond <- names(ora_filtered_results_list)[i]
# Perform g:Profiler analysis for upregulated and downregulated genes separately
go_analysis <- gost(
query = list(Downregulated = downregulated_genesets[[cond]],
Upregulated = upregulated_genesets[[cond]]),
organism = "hsapiens",
sources = c("GO:BP")
)
# Add the results to the list
patient_analysis_list[[cond]] <- go_analysis
}
# Remove the first 4 empty entries
patient_analysis_list <- patient_analysis_list[-(1:4)]
We will now display both.
# Initialize an empty data frame to hold the top pathways for each condition and direction
top_pathways_df <- data.frame(matrix(ncol = length(names(days_analysis_list)) * 2, nrow = 10))
# Set the column names for the data frame
colnames(top_pathways_df) <- paste(rep(names(days_analysis_list), each = 2), rep(c("Upregulated", "Downregulated"), times = length(names(days_analysis_list))), sep = "_")
# Fill the data frame with the top pathways
for (cond in names(days_analysis_list)) {
# Extract the top 10 upregulated pathways
top_upreg <- days_analysis_list[[cond]]$result %>%
filter(query == "Upregulated", term_size <= 1000) %>%
arrange(p_value) %>%
slice(1:10) %>%
.$term_name
# Ensure that we have exactly 10 entries by padding with NA if necessary
top_upreg <- top_upreg[1:10]
if (length(top_upreg) < 10) {
top_upreg <- c(top_upreg, rep(NA, 10 - length(top_upreg)))
}
# Extract the top 10 downregulated pathways
top_downreg <- days_analysis_list[[cond]]$result %>%
filter(query == "Downregulated", term_size <= 1000) %>%
arrange(p_value) %>%
slice(1:10) %>%
.$term_name
# Ensure that we have exactly 10 entries by padding with NA if necessary
top_downreg <- top_downreg[1:10]
if (length(top_downreg) < 10) {
top_downreg <- c(top_downreg, rep(NA, 10 - length(top_downreg)))
}
# Add the pathways to the data frame
top_pathways_df[paste(cond, "Upregulated", sep = "_")] <- top_upreg
top_pathways_df[paste(cond, "Downregulated", sep = "_")] <- top_downreg
}
# Create the kable
kable(top_pathways_df, caption = "Table 7: Top 10 upregulated and downregulated pathways for each temporal comparison. Intepretation of these results are in the Interpretation section.", align = 'c')
| results_twoVS0_Upregulated | results_twoVS0_Downregulated | results_fourVS2_Upregulated | results_fourVS2_Downregulated | results_sixVS4_Upregulated | results_sixVS4_Downregulated | results_tenVS6_Upregulated | results_tenVS6_Downregulated | results_fifteenVS10_Upregulated | results_fifteenVS10_Downregulated | results_twentyVS15_Upregulated | results_twentyVS15_Downregulated | results_fortyVS20_Upregulated | results_fortyVS20_Downregulated | results_sixtyVS40_Upregulated | results_sixtyVS40_Downregulated | results_sixtyVS0_Upregulated | results_sixtyVS0_Downregulated |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| regulation of anatomical structure morphogenesis | neuron projection morphogenesis | heart development | ribosome biogenesis | heart development | DNA replication | muscle system process | chromosome organization | muscle system process | cellular response to growth factor stimulus | actin cytoskeleton organization | cytoplasmic translation | external encapsulating structure organization | mitotic cell cycle | NA | fructose metabolic process | muscle structure development | ribosome biogenesis |
| embryonic morphogenesis | plasma membrane bounded cell projection morphogenesis | muscle structure development | ncRNA metabolic process | muscle structure development | DNA-templated DNA replication | muscle contraction | mitotic cell cycle | striated muscle contraction | response to growth factor | actin filament-based process | peptide metabolic process | extracellular matrix organization | mitotic cell cycle process | NA | carbohydrate phosphorylation | heart development | ncRNA metabolic process |
| tissue morphogenesis | cell projection morphogenesis | tissue morphogenesis | rRNA metabolic process | muscle system process | DNA repair | striated muscle contraction | mitotic cell cycle process | actin filament-based movement | regulation of cell migration | supramolecular fiber organization | translation | extracellular structure organization | chromosome organization | NA | fructose 2,6-bisphosphate metabolic process | tube morphogenesis | ncRNA processing |
| heart development | cellular anatomical entity morphogenesis | heart morphogenesis | rRNA processing | muscle tissue development | DNA damage response | oxoacid metabolic process | chromosome segregation | cardiac muscle contraction | NA | cell junction organization | peptide biosynthetic process | cellular response to growth factor stimulus | chromosome segregation | NA | monosaccharide metabolic process | vasculature development | translation |
| head development | cell morphogenesis | mesenchyme development | ncRNA processing | striated muscle tissue development | chromatin organization | organic acid metabolic process | nuclear chromosome segregation | regulation of muscle system process | NA | apoptotic signaling pathway | amide biosynthetic process | circulatory system process | mitotic nuclear division | NA | NA | blood vessel development | rRNA metabolic process |
| gastrulation | cell morphogenesis involved in neuron differentiation | enzyme-linked receptor protein signaling pathway | ribosomal small subunit biogenesis | cardiac muscle tissue development | protein-DNA complex organization | carboxylic acid metabolic process | cell division | actin-mediated cell contraction | NA | enzyme-linked receptor protein signaling pathway | transmembrane receptor protein serine/threonine kinase signaling pathway | response to growth factor | organelle fission | NA | NA | muscle tissue development | amide biosynthetic process |
| morphogenesis of an epithelium | chemical synaptic transmission | tube morphogenesis | ribosomal large subunit biogenesis | muscle contraction | chromosome organization | generation of precursor metabolites and energy | nuclear division | muscle contraction | NA | cell morphogenesis | transforming growth factor beta receptor superfamily signaling pathway | regulation of system process | nuclear division | NA | NA | muscle cell differentiation | peptide biosynthetic process |
| tube morphogenesis | anterograde trans-synaptic signaling | mesenchymal cell differentiation | maturation of SSU-rRNA | actin filament-based process | double-strand break repair | monocarboxylic acid metabolic process | sister chromatid segregation | cardiac muscle cell contraction | NA | ameboidal-type cell migration | enzyme-linked receptor protein signaling pathway | enzyme-linked receptor protein signaling pathway | mitotic sister chromatid segregation | NA | NA | striated muscle tissue development | rRNA processing |
| embryo development ending in birth or egg hatching | trans-synaptic signaling | embryonic morphogenesis | peptide metabolic process | muscle cell differentiation | chromatin remodeling | heart process | regulation of cell cycle process | regulation of system process | NA | cell adhesion mediated by integrin | protein-lipid complex remodeling | camera-type eye development | sister chromatid segregation | NA | NA | cardiac muscle tissue development | DNA damage response |
| chordate embryonic development | regulation of cell projection organization | cellular anatomical entity morphogenesis | peptide biosynthetic process | striated muscle cell differentiation | negative regulation of transcription by RNA polymerase II | cardiac muscle contraction | DNA replication | regulation of muscle adaptation | NA | response to wounding | plasma lipoprotein particle remodeling | sensory system development | nuclear chromosome segregation | NA | NA | blood vessel morphogenesis | chromosome organization |
First, we will be excluding the patient samples from this analysis. They proved to be very problematic with fgsea.
# Set all the conditions involving patients to null
results_list$results_lvVS60 <- NULL
results_list$results_rvVS60 <- NULL
results_list$results_laVS60 <- NULL
results_list$results_raVS60 <- NULL
results_list$results_lvVSrv <- NULL
Now we can generate our ranks, and we’ll save them to a directory called ranks such that we can easily access them downstream during our Cytoscape analysis.
rank_and_write <- function(results_list, dir_name) {
# Create the directory if it doesn't exist
if (!dir.exists(dir_name)) {
dir.create(dir_name)
}
# Iterate over each comparison in the results_list
for (comparison in names(results_list)) {
# Extract the data and check for any missing values in necessary columns
results_table <- results_list[[comparison]]$table
if (any(is.na(results_table$logFC)) || any(is.na(results_table$FDR))) {
stop("Missing values found in logFC or FDR columns.")
}
# Calculate ranks
ranks <- -1 * sign(results_table$logFC) * log10(results_table$FDR)
# Add a very small random noise to the rank to break ties
set.seed(123) # Setting a seed for reproducibility
noise <- runif(length(ranks), min=-1e-6, max=1e-6)
ranks <- ranks + noise
# Prepare data frame for writing
ordered_results <- data.frame(
GeneName = rownames(results_table), # Use row names as GeneName
rank = ranks
)
# Remove duplicate gene names if necessary
ordered_results <- ordered_results[!duplicated(ordered_results$GeneName), ]
# Order by rank and keep only the necessary columns
ordered_results <- ordered_results[order(ordered_results$rank, decreasing = TRUE), ]
# Write to .rnk file
file_name <- paste0(comparison, "_ranks.rnk")
write.table(ordered_results, quote = FALSE, sep = "\t", col.names = TRUE, row.names = FALSE,
file = file.path(dir_name, file_name))
}
}
# Example usage
rank_and_write(results_list, "ranks")
Now that we’ve calculated our ranks, we can move on to performing fgsea.
First, we’ll select a geneset for the fgsea analysis. We will be using the Bader Lab(bader_lab_2024?) geneset since this was recommended by the course instructor. Additionally, when we do Cytoscape downstream, we’ll also notice that we’ll be using the Bader Lab’s genesets again, so this works to streamline our results.
We’ll fetch the geneset first.
# Define the URL and the local filename
gmt_url <- "https://download.baderlab.org/EM_Genesets/April_01_2024/Human/symbol/Human_GOBP_AllPathways_noPFOCR_no_GO_iea_April_01_2024_symbol.gmt"
gmt_filename <- "Human_GOBP_AllPathways_noPFOCR_no_GO_iea_April_01_2024_symbol.gmt"
# Check if the file already exists
if (!file.exists(gmt_filename)) {
# Download the file
download.file(gmt_url, destfile = gmt_filename)
}
# Load the gene sets
gene_sets <- fgsea::gmtPathways(gmt_filename)
fgsea (Korotkevich et al. 2016) is a lightweight implementation of gsea that runs directly in R. I chose to use this since I was having trouble using the recommended GSEA analysis pipeline discussed in class. Additionally, the instructor had also remarked that she’d used fgsea as well, and results were comparable to the base implementation. We are using the newest version to date, 3.18. Therefore, I chose to use this platform.
Now, let’s define a function to run fgsea for each condition that we have in the results_list.
run_fgsea_for_all_conditions <- function(ranks_dir, gene_sets, useSimple = FALSE) {
nPermSimple = ifelse(useSimple, 100000, NA) # This value is finnicky, adjust according to analysis
# Get the rank files
rnk_files <- list.files(ranks_dir, pattern = "\\.rnk$", full.names = TRUE)
gsea_list <- list()
for (rnk_file in rnk_files) {
condition <- gsub("_ranks\\.rnk$", "", basename(rnk_file))
# Skip atria and ventricle related results, if they haven't already been properly filtered out
if (grepl("LV|RV|LA|RA", condition)) {
cat("Skipping: ", condition, "\n")
next
}
# This is added for a progress check since it takes long for fgsea to run, especially considering the high nPermSimple value
print(paste("Working on:", condition))
ranks <- read.table(rnk_file, header = TRUE, stringsAsFactors = FALSE)
ranks_vector <- setNames(ranks$rank, ranks$GeneName)
# Run fgsea
fgsea_results <- if (useSimple) {
fgsea::fgsea(pathways = gene_sets, stats = ranks_vector, minSize = 15, maxSize = 500, nPermSimple = nPermSimple)
} else {
fgsea::fgseaMultilevel(pathways = gene_sets, stats = ranks_vector, minSize = 15, maxSize = 500)
}
# Save our results to an gsea_list
gsea_list[[condition]] <- fgsea_results
}
return(gsea_list)
}
Now, let’s get our list of fgsea values.
gsea_list <- run_fgsea_for_all_conditions("ranks", gene_sets, useSimple=TRUE)
## [1] "Working on: results_fifteenVS10"
## [1] "Working on: results_fortyVS20"
## [1] "Working on: results_fourVS2"
## [1] "Working on: results_sixtyVS0"
## [1] "Working on: results_sixtyVS40"
## [1] "Working on: results_sixVS4"
## [1] "Working on: results_tenVS6"
## [1] "Working on: results_twentyVS15"
## [1] "Working on: results_twoVS0"
Just as a double check to remove patient entries:
# Remove specific entries
gsea_list$results_lvVS60 <- NULL
gsea_list$results_rvVS60 <- NULL
gsea_list$results_lvVSrv <- NULL
gsea_list$results_laVS60 <- NULL
gsea_list$results_raVS60 <- NULL
fgsea formats its output differently from the GSEA technique discussed in lecture. To remedy this, instructor, Ruth Isserlin, provided a function on Quercus to perform this conversion. I’ve modified it quite a bit so that we can use it to display table data.
format_fgsea_results <- function(gsea_list, ranks_dir) {
formatted_gsea_list <- list()
# Load all ranks as a list of named vectors
rank_files <- list.files(ranks_dir, pattern = "\\.rnk$", full.names = TRUE)
ranks_list <- lapply(rank_files, function(file) {
ranks <- read.table(file, header = TRUE, stringsAsFactors = FALSE)
setNames(ranks$rank, ranks$GeneName)
})
names(ranks_list) <- gsub("_ranks\\.rnk$", "", basename(rank_files))
for (analysis in names(gsea_list)) {
fgsea_results <- gsea_list[[analysis]]
ranks_vector <- ranks_list[[analysis]] # Retrieve the corresponding rank vector
if (length(fgsea_results) > 0 && !is.null(ranks_vector)) {
calculated_rank_at_max <- sapply(fgsea_results$leadingEdge, function(leading_edge) {
valid_genes = names(ranks_vector)[names(ranks_vector) %in% leading_edge]
if (length(valid_genes) > 0) {
return(max(ranks_vector[valid_genes]))
} else {
return(NA) # Return NA if no valid genes
}
})
gsea_results <- data.frame(
name = fgsea_results$pathway,
SIZE = fgsea_results$size,
ES = fgsea_results$ES,
NES = fgsea_results$NES,
pval = fgsea_results$pval,
padj = fgsea_results$padj,
RankAtMax = calculated_rank_at_max,
leadingEdgeGenes = sapply(fgsea_results$leadingEdge, paste, collapse = ",")
)
formatted_gsea_list[[analysis]] <- gsea_results
} else {
warning(paste("No valid fgsea results or ranks found for analysis:", analysis))
}
}
return(formatted_gsea_list)
}
Now, we’ll format our results.
formatted_gsea_list <- format_fgsea_results(gsea_list, "ranks")
Now, let’s look at the results we got from our fgsea analysis.
Let’s break our results down into up and down-regulated analyses
formatted_gsea_df <- lapply(formatted_gsea_list, as.data.frame)
We’ll define a function to filter our pathways for significant up and down-regulated pathways and return both.
# Get all of the pathways with a positive normalization enrichment score that are significant
filter_pathways <- function(df) {
# Filter for enrichment score in either direction and a adjusted p-value of 0.05
upregulated_pathways <- df[df$NES > 0 & df$padj < 0.05, ]
upregulated_pathways <- upregulated_pathways[order(upregulated_pathways$padj, abs(upregulated_pathways$NES), decreasing = c(FALSE, TRUE)),]
downregulated_pathways <- df[df$NES < 0 & df$padj < 0.05, ]
downregulated_pathways <- downregulated_pathways[order(downregulated_pathways$padj, abs(downregulated_pathways$NES), decreasing = c(FALSE, TRUE)),]
# Return a list
list(
upregulated = upregulated_pathways,
downregulated = downregulated_pathways
)
}
filtered_pathways <- lapply(formatted_gsea_df, filter_pathways)
Now that we’ve filtered our pathways, we can now display our top 10 up and down-regulated pathways such that we can compare this to our GO analysis.
# Initialize lists for upregulated and downregulated pathways
upregulated_pathways_list <- list()
downregulated_pathways_list <- list()
# Populate the lists with corresponding entries from filtered_pathways
for(condition in names(filtered_pathways)) {
upregulated_pathways_list[[condition]] <- filtered_pathways[[condition]][["upregulated"]]
downregulated_pathways_list[[condition]] <- filtered_pathways[[condition]][["downregulated"]]
}
We can define a function to get the top 10 pathways for each condition. 10 pathways each that are most significant gives us a pretty good idea as to what’s changing pathway-wise between these conditions.
# Function to get the top 10 pathway names
get_top_pathways <- function(pathways_df) {
if(nrow(pathways_df) >= 10) {
return(pathways_df$name[1:10])
} else {
# If there are less than 10, return as many as there are and pad with NAs
return(c(pathways_df$name, rep(NA, 10 - nrow(pathways_df))))
}
}
# Applying the function to each list
top_upregulated_pathways <- lapply(upregulated_pathways_list, get_top_pathways)
top_downregulated_pathways <- lapply(downregulated_pathways_list, get_top_pathways)
Let’s do up first.
upregulated_pathways_df <- data.frame(top_upregulated_pathways, stringsAsFactors = FALSE)
upregulated_pathways_df <- upregulated_pathways_df[c("results_twoVS0", "results_fourVS2", "results_sixVS4", "results_tenVS6", "results_fifteenVS10", "results_twentyVS15", "results_fortyVS20", "results_sixtyVS40", "results_sixtyVS0")]
# Generate the table with kable
kable(upregulated_pathways_df, col.names = names(upregulated_pathways_df), caption = "Table 8: Top ranked pathways that are upregulated in pairwise timepoints from GSEA analysis.", align = 'c')
| results_twoVS0 | results_fourVS2 | results_sixVS4 | results_tenVS6 | results_fifteenVS10 | results_twentyVS15 | results_fortyVS20 | results_sixtyVS40 | results_sixtyVS0 |
|---|---|---|---|---|---|---|---|---|
| ANIMAL ORGAN MORPHOGENESIS%GOBP%GO:0009887 | STRIATED MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0014706 | MUSCLE STRUCTURE DEVELOPMENT%GOBP%GO:0061061 | GENERATION OF PRECURSOR METABOLITES AND ENERGY%GOBP%GO:0006091 | CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 | HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION%MSIGDBHALLMARK%HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION | PROTEIN SYNTHESIS: ASPARTIC ACID%SMPDB%SMP0111858 | POTASSIUM ION IMPORT ACROSS PLASMA MEMBRANE%GOBP%GO:1990573 | CIRCULATORY SYSTEM DEVELOPMENT%GOBP%GO:0072359 |
| CIRCULATORY SYSTEM DEVELOPMENT%GOBP%GO:0072359 | CARDIAC MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0048738 | MUSCLE SYSTEM PROCESS%GOBP%GO:0003012 | AEROBIC RESPIRATION AND RESPIRATORY ELECTRON TRANSPORT%REACTOME%R-HSA-1428517.4 | HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS | CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 | PROTEIN SYNTHESIS: GLUTAMINE%SMPDB%SMP0111862 | INORGANIC ION TRANSMEMBRANE TRANSPORT%GOBP%GO:0098660 | MUSCLE STRUCTURE DEVELOPMENT%GOBP%GO:0061061 |
| HEART DEVELOPMENT%GOBP%GO:0007507 | MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0060537 | REGULATION OF SYSTEM PROCESS%GOBP%GO:0044057 | ENERGY DERIVATION BY OXIDATION OF ORGANIC COMPOUNDS%GOBP%GO:0015980 | CHROMOSOME ORGANIZATION%GOBP%GO:0051276 | MITOTIC METAPHASE AND ANAPHASE%REACTOME DATABASE ID RELEASE 81%2555396 | PROTEIN SYNTHESIS: ALANINE%PATHWHIZ%PW101384 | NEGATIVE REGULATION OF PROTEIN SECRETION%GOBP%GO:0050709 | HEART DEVELOPMENT%GOBP%GO:0007507 |
| ANATOMICAL STRUCTURE FORMATION INVOLVED IN MORPHOGENESIS%GOBP%GO:0048646 | CARDIAC CHAMBER DEVELOPMENT%GOBP%GO:0003205 | MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0060537 | MUSCLE SYSTEM PROCESS%GOBP%GO:0003012 | HDR THROUGH HOMOLOGOUS RECOMBINATION (HRR)%REACTOME DATABASE ID RELEASE 81%5685942 | MITOTIC ANAPHASE%REACTOME DATABASE ID RELEASE 81%68882 | PROTEIN SYNTHESIS: GLUTAMIC ACID%PATHWHIZ%PW112922 | MONOATOMIC ION TRANSMEMBRANE TRANSPORT%GOBP%GO:0034220 | EXTRACELLULAR MATRIX ORGANIZATION%REACTOME%R-HSA-1474244.4 |
| EMBRYO DEVELOPMENT%GOBP%GO:0009790 | CARDIAC SEPTUM DEVELOPMENT%GOBP%GO:0003279 | MUSCLE CONTRACTION%GOBP%GO:0006936 | HALLMARK_MYOGENESIS%MSIGDBHALLMARK%HALLMARK_MYOGENESIS | NUCLEAR CHROMOSOME SEGREGATION%GOBP%GO:0098813 | CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 | PROTEIN SYNTHESIS: PROLINE%PATHWHIZ%PW113695 | HALLMARK_INTERFERON_ALPHA_RESPONSE%MSIGDBHALLMARK%HALLMARK_INTERFERON_ALPHA_RESPONSE | MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0060537 |
| EMBRYONIC MORPHOGENESIS%GOBP%GO:0048598 | CELLULAR COMPONENT ASSEMBLY INVOLVED IN MORPHOGENESIS%GOBP%GO:0010927 | MUSCLE CELL DIFFERENTIATION%GOBP%GO:0042692 | PURINE NUCLEOTIDE METABOLIC PROCESS%GOBP%GO:0006163 | CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 | HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS | PROTEIN SYNTHESIS: SERINE%PATHWHIZ%PW120517 | NA | REGULATION OF SYSTEM PROCESS%GOBP%GO:0044057 |
| TUBE DEVELOPMENT%GOBP%GO:0035295 | CARDIAC CELL DEVELOPMENT%GOBP%GO:0055006 | ACTIN CYTOSKELETON ORGANIZATION%GOBP%GO:0030036 | PURINE-CONTAINING COMPOUND METABOLIC PROCESS%GOBP%GO:0072521 | MITOTIC CELL CYCLE PROCESS%GOBP%GO:1903047 | SEPARATION OF SISTER CHROMATIDS%REACTOME DATABASE ID RELEASE 81%2467813 | PROTEIN SYNTHESIS: TRYPTOPHAN%PATHWHIZ%PW120526 | NA | MUSCLE SYSTEM PROCESS%GOBP%GO:0003012 |
| TISSUE MORPHOGENESIS%GOBP%GO:0048729 | OUTFLOW TRACT MORPHOGENESIS%GOBP%GO:0003151 | CELLULAR COMPONENT ASSEMBLY INVOLVED IN MORPHOGENESIS%GOBP%GO:0010927 | PURINE RIBONUCLEOTIDE METABOLIC PROCESS%GOBP%GO:0009150 | REGULATION OF BLOOD CIRCULATION%GOBP%GO:1903522 | SYNTHESIS OF DNA%REACTOME DATABASE ID RELEASE 81%69239 | PROTEIN SYNTHESIS: ISOLEUCINE%SMPDB%SMP0111872 | NA | ANIMAL ORGAN MORPHOGENESIS%GOBP%GO:0009887 |
| EPITHELIUM DEVELOPMENT%GOBP%GO:0060429 | CARDIAC MUSCLE CELL DEVELOPMENT%GOBP%GO:0055013 | STRIATED MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0014706 | AEROBIC RESPIRATION%GOBP%GO:0009060 | S PHASE%REACTOME DATABASE ID RELEASE 81%69242 | RHO GTPASE CYCLE%REACTOME DATABASE ID RELEASE 81%9012999 | PROTEIN SYNTHESIS: LEUCINE%SMPDB%SMP0111873 | NA | MUSCLE CONTRACTION%GOBP%GO:0006936 |
| REGIONALIZATION%GOBP%GO:0003002 | CARDIAC CHAMBER MORPHOGENESIS%GOBP%GO:0003206 | HALLMARK_MYOGENESIS%MSIGDBHALLMARK%HALLMARK_MYOGENESIS | HALLMARK_OXIDATIVE_PHOSPHORYLATION%MSIGDBHALLMARK%HALLMARK_OXIDATIVE_PHOSPHORYLATION | MUSCLE SYSTEM PROCESS%GOBP%GO:0003012 | TNFR2 NON-CANONICAL NF-KB PATHWAY%REACTOME%R-HSA-5668541.5 | PROTEIN SYNTHESIS: ASPARAGINE%SMPDB%SMP0111854 | NA | MUSCLE CELL DIFFERENTIATION%GOBP%GO:0042692 |
And now down.
downregulated_pathways_df <- data.frame(top_downregulated_pathways, stringsAsFactors = FALSE)
downregulated_pathways_df <- downregulated_pathways_df[c("results_twoVS0", "results_fourVS2", "results_sixVS4", "results_tenVS6", "results_fifteenVS10", "results_twentyVS15", "results_fortyVS20", "results_sixtyVS40", "results_sixtyVS0")]
# Generate the table with kable
kable(downregulated_pathways_df, col.names = names(downregulated_pathways_df), caption = "Table 9: Top ranked pathways that are downregulated in pairwise timepoints from GSEA analysis.", align = 'c')
| results_twoVS0 | results_fourVS2 | results_sixVS4 | results_tenVS6 | results_fifteenVS10 | results_twentyVS15 | results_fortyVS20 | results_sixtyVS40 | results_sixtyVS0 |
|---|---|---|---|---|---|---|---|---|
| HALLMARK_MTORC1_SIGNALING%MSIGDBHALLMARK%HALLMARK_MTORC1_SIGNALING | RIBOSOME BIOGENESIS%GOBP%GO:0042254 | DNA REPAIR%GOBP%GO:0006281 | CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 | NCRNA METABOLIC PROCESS%GOBP%GO:0034660 | PROTEIN SYNTHESIS: ASPARTIC ACID%SMPDB%SMP0111858 | CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 | CYTOPLASMIC TRANSLATION%GOBP%GO:0002181 | RIBONUCLEOPROTEIN COMPLEX BIOGENESIS%GOBP%GO:0022613 |
| MODULATION OF CHEMICAL SYNAPTIC TRANSMISSION%GOBP%GO:0050804 | RIBONUCLEOPROTEIN COMPLEX BIOGENESIS%GOBP%GO:0022613 | PROCESSING OF CAPPED INTRON-CONTAINING PRE-MRNA%REACTOME DATABASE ID RELEASE 81%72203 | MITOTIC CELL CYCLE PROCESS%GOBP%GO:1903047 | NCRNA PROCESSING%GOBP%GO:0034470 | PROTEIN SYNTHESIS: GLUTAMINE%SMPDB%SMP0111862 | M PHASE%REACTOME%R-HSA-68886.5 | PROTEIN SYNTHESIS: GLUTAMINE%SMPDB%SMP0111862 | CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 |
| REGULATION OF TRANS-SYNAPTIC SIGNALING%GOBP%GO:0099177 | NCRNA METABOLIC PROCESS%GOBP%GO:0034660 | HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS | CHROMOSOME ORGANIZATION%GOBP%GO:0051276 | TRANSLATION%GOBP%GO:0006412 | PROTEIN SYNTHESIS: LEUCINE%SMPDB%SMP0111873 | MITOTIC CELL CYCLE PROCESS%GOBP%GO:1903047 | PROTEIN SYNTHESIS: VALINE%PATHWHIZ%PW120528 | NCRNA METABOLIC PROCESS%GOBP%GO:0034660 |
| CHOLESTEROL BIOSYNTHESIS%REACTOME DATABASE ID RELEASE 81%191273 | NCRNA PROCESSING%GOBP%GO:0034470 | CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 | M PHASE%REACTOME%R-HSA-68886.5 | POST-TRANSCRIPTIONAL REGULATION OF GENE EXPRESSION%GOBP%GO:0010608 | PROTEIN SYNTHESIS: GLUTAMIC ACID%PATHWHIZ%PW112922 | CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 | PROTEIN SYNTHESIS: LYSINE%SMPDB%SMP0111874 | PROCESSING OF CAPPED INTRON-CONTAINING PRE-MRNA%REACTOME DATABASE ID RELEASE 81%72203 |
| LONG-TERM SYNAPTIC POTENTIATION%GOBP%GO:0060291 | RRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME DATABASE ID RELEASE 81%8868773 | CHROMOSOME ORGANIZATION%GOBP%GO:0051276 | HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS | RRNA PROCESSING%REACTOME DATABASE ID RELEASE 81%72312 | PROTEIN SYNTHESIS: PROLINE%PATHWHIZ%PW113695 | MITOTIC ANAPHASE%REACTOME DATABASE ID RELEASE 81%68882 | PROTEIN SYNTHESIS: CYSTEINE%PATHWHIZ%PW112918 | NCRNA PROCESSING%GOBP%GO:0034470 |
| TRANSPORT OF INORGANIC CATIONS ANIONS AND AMINO ACIDS OLIGOPEPTIDES%REACTOME%R-HSA-425393.4 | RRNA PROCESSING%REACTOME DATABASE ID RELEASE 81%72312 | RIBONUCLEOPROTEIN COMPLEX BIOGENESIS%GOBP%GO:0022613 | CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 | RIBOSOME BIOGENESIS%GOBP%GO:0042254 | PROTEIN SYNTHESIS: ALANINE%PATHWHIZ%PW101384 | MITOTIC METAPHASE AND ANAPHASE%REACTOME DATABASE ID RELEASE 81%2555396 | PROTEIN SYNTHESIS: SERINE%PATHWHIZ%PW120517 | CHROMOSOME ORGANIZATION%GOBP%GO:0051276 |
| SUPERPATHWAY OF CHOLESTEROL BIOSYNTHESIS%BIOCYC%PWY66-5 | RRNA METABOLIC PROCESS%GOBP%GO:0016072 | RRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME DATABASE ID RELEASE 81%8868773 | CHROMOSOME SEGREGATION%GOBP%GO:0007059 | RIBONUCLEOPROTEIN COMPLEX BIOGENESIS%GOBP%GO:0022613 | PROTEIN SYNTHESIS: LYSINE%SMPDB%SMP0111874 | CHROMOSOME ORGANIZATION%GOBP%GO:0051276 | PROTEIN SYNTHESIS: ALANINE%PATHWHIZ%PW101384 | CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 |
| POSITIVE REGULATION OF SYNAPTIC TRANSMISSION%GOBP%GO:0050806 | RRNA PROCESSING%GOBP%GO:0006364 | RRNA PROCESSING%REACTOME DATABASE ID RELEASE 81%72312 | DNA REPAIR%GOBP%GO:0006281 | RRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME DATABASE ID RELEASE 81%8868773 | PROTEIN SYNTHESIS: ASPARAGINE%SMPDB%SMP0111854 | HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS | PROTEIN SYNTHESIS: GLUTAMIC ACID%PATHWHIZ%PW112922 | MRNA METABOLIC PROCESS%GOBP%GO:0016071 |
| BLOOD GROUP SYSTEMS BIOSYNTHESIS%REACTOME%R-HSA-9033658.3 | MAJOR PATHWAY OF RRNA PROCESSING IN THE NUCLEOLUS AND CYTOSOL%REACTOME%R-HSA-6791226.5 | DNA REPLICATION%GOBP%GO:0006260 | MITOTIC METAPHASE AND ANAPHASE%REACTOME DATABASE ID RELEASE 81%2555396 | MRNA METABOLIC PROCESS%GOBP%GO:0016071 | PROTEIN SYNTHESIS: SERINE%PATHWHIZ%PW120517 | CHROMOSOME SEGREGATION%GOBP%GO:0007059 | PROTEIN SYNTHESIS: PROLINE%PATHWHIZ%PW113695 | M PHASE%REACTOME%R-HSA-68886.5 |
| CALCIUM ION TRANSMEMBRANE IMPORT INTO CYTOSOL%GOBP%GO:0097553 | HALLMARK_MYC_TARGETS_V1%MSIGDBHALLMARK%HALLMARK_MYC_TARGETS_V1 | MAJOR PATHWAY OF RRNA PROCESSING IN THE NUCLEOLUS AND CYTOSOL%REACTOME%R-HSA-6791226.5 | MITOTIC ANAPHASE%REACTOME DATABASE ID RELEASE 81%68882 | MAJOR PATHWAY OF RRNA PROCESSING IN THE NUCLEOLUS AND CYTOSOL%REACTOME%R-HSA-6791226.5 | PROTEIN SYNTHESIS: ISOLEUCINE%SMPDB%SMP0111872 | NUCLEAR CHROMOSOME SEGREGATION%GOBP%GO:0098813 | PROTEIN SYNTHESIS: TRYPTOPHAN%PATHWHIZ%PW120526 | RIBOSOME BIOGENESIS%GOBP%GO:0042254 |
The results we see here are extremely similar to the results that we got when we ran g:Profiler.
For upregulation, we see that at the beginning, we have a lot of developmental features, including tissue morphogenesis, cardiac and muscle tissue formation, structure formation and differentiation. We see that at the 40vs20 comparison, we see that all pathways are related to amino acid synthesis, which probably indicates that at this stage, the iPSCs are undergoing massive amount of protein synthesis. This could be related to the final bits of differentiation it will undergo.
I will mention, however, that the 40vs20 for the g:Profiler results are radically different, with them focussion more on ECM organization and response regulation.
I assume that this discrepancy likely comes down to the different annotation and geneset sources used, as variations in this results can affect this output.
In downregulation, the early stages seem to involve decreasing synaptic transmission, ribosomal development, DNA synthesis and repair, cell cycle elements. Interestingly, we actually see that the protein synthesis pathways seen upregulated in 40v20 are downregulated at 20v15. This likely means that there was a cellular event during differentiation where protein synthesis was heavily dis-favoured and after this event, amino acid synthesis ramped back up, presumbaly to meet the cells’ protein demands. Towards the tail end are just downregulation in chromosomal features and various metabolic pathways.
Once again, these results closely match the g:Profiler results. This is qualitative, however, since different parameters were used for both analyses, and the analyses itself is akin to apples and pears (same type of fruit-ish, but vastly different). Additionally, different reference sets were used, so this further pushes the idea that these results aren’t directly comparable.
We can proceed with our Cytoscape analysis. Cytoscape (Shannon et al. 2003) is a piece of software that allows for graph-based views of lists. It was initially developed with the bioinformatics community in mind, and allows for pathway analysis via connection nodes that cluster according to gene sets. This is a powerful, graphical way of analyzing biological data in addition to what we’ve already done.
We’ve been following a bit of a pre-built pipeline from a paper published by Ruth and colleagues(Reimand et al. 2019). First, we will modify and clean up the code she provided to format the fgsea results into a format that Cytoscape will take in. Then, we’ll make our networks using Cytoscape.
As an aside, I realize that I’ve routinely had among the longest analyses of all students in the course since I did a longer time-series. As a result, I’m focussing on three timepoints instead, 60 vs 40, 40 vs 20, and 20 vs 15. Additionally, I also made a network for 60 vs 0 to compare the differences from the beginning to the end of the analysis.
First, let’s get our GSEA results ready.
# The following code is courtesy of Ruth Isserlin
# It was updated and formatted for better readability
format_fgsea_results2 <- function(current_fgsea_results, current_ranks) {
# Calculate the rank at max
# fgsea returns the leading edge
# We need to extract the highest rank from the set to get the rank at max
calculated_rank_at_max <- apply(
current_fgsea_results,
1,
FUN=function(x){
max(which(names(current_ranks)
%in%
unlist(x[8])
)
)
}
)
# The last column is a comma separated list of genes that are found in the leading edge.
# And the column will be ignored but you have to put something in that column, might as put something that might be useful
gsea_results <- cbind(current_fgsea_results$pathway,
current_fgsea_results$pathway,
"Details",
current_fgsea_results$size,
current_fgsea_results$ES,
current_fgsea_results$NES,
current_fgsea_results$pval,
current_fgsea_results$padj,
0,
calculated_rank_at_max,
apply(current_fgsea_results,1,
FUN=function(x){
paste(unlist(x[8]), collapse=",")
}
)
)
colnames(gsea_results) <- c("name",
"description",
"GS details",
"SIZE",
"ES",
"NES",
"pval",
"padj",
"FWER",
"Rank at Max",
"leading edge genes")
return(gsea_results)
}
Now, let’s create a list of our Cytoscape-ready GSEA results
# Function to load ranks from a file
load_ranks <- function(rank_file) {
ranks <- fread(rank_file, header = TRUE, sep = "\t")
setNames(ranks$rank, ranks$GeneName)
}
# Function to process GSEA results and save them to TSV files
process_and_save_gsea_results <- function(gsea_results, rank_file, output_dir) {
# Load ranks
current_ranks <- load_ranks(rank_file)
# Format the fgsea results using the provided function
formatted_results <- as.data.frame(format_fgsea_results2(gsea_results, current_ranks))
# Filter for positive and negative results based on NES and adjusted p-value
positive_results <- formatted_results[formatted_results$NES > 0 & formatted_results$padj < 0.05, ]
negative_results <- formatted_results[formatted_results$NES < 0 & formatted_results$padj < 0.05, ]
# Generate output file paths
base_name <- gsub("\\.rnk$", "", basename(rank_file))
positive_file_path <- file.path(output_dir, paste0(base_name, "_positive.tsv"))
negative_file_path <- file.path(output_dir, paste0(base_name, "_negative.tsv"))
# Write the results to TSV files for Cytoscape import
fwrite(positive_results, positive_file_path, sep = "\t", quote = FALSE)
fwrite(negative_results, negative_file_path, sep = "\t", quote = FALSE)
}
# Specify the directories
rank_dir <- "ranks"
output_dir <- "Cytoscape_inputs"
dir.create(output_dir, showWarnings = FALSE)
# List all rank files and process them
rank_files <- list.files(rank_dir, pattern = "\\.rnk$", full.names = TRUE)
for (i in seq_along(rank_files)) {
process_and_save_gsea_results(gsea_list[[i]], rank_files[i], output_dir)
}
As a note, the following analyses were all conducted in Cytoscape. We installed the AutoAnnotate app so that we could do automatic annotations of our data.
All analyses used a default parameter of Q-value = 0.1 and Edge Cutoff = 0.375.
We’ll also display the number of nodes and edges for each
network_table <- data.frame(
condition = c("60 VS 0", "60 VS 40", "40 VS 20", "20 VS 15"),
nodes = c(974, 76, 632, 1241),
edges = c(2658, 1796, 3263, 12571)
)
# Use kable to display the table
kable(network_table, caption = "Table x: Network Analysis Results", align = 'c')
| condition | nodes | edges |
|---|---|---|
| 60 VS 0 | 974 | 2658 |
| 60 VS 40 | 76 | 1796 |
| 40 VS 20 | 632 | 3263 |
| 20 VS 15 | 1241 | 12571 |
We next used AutoAnnotate to annotate our data.
The clusterMaker App was used.
The Cluster algorithm was MCL Cluster
The Edge weight column used the similarity-coefficient.
Singleton clusters were created.
The networks were laid out such that they would try to prevent cluster overlap.
To do this, the Publication-Ready box was checked off in the EnrichmentMap panel. Legends were added as well.